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Abstract We revise the Bayesian inference steps required to analyse the cosmo- 
logical large-scale structure. Here we make special emphasis in the complications 
which arise due to the non-Gaussian character of the galaxy and matter distribution. 
In particular we investigate the advantages and limitations of the Poisson-lognormal 
model and discuss how to extend this work. With the lognormal prior using the 
Hamiltonian sampling technique and on scales of about 4 h^^ Mpc we find that the 
over-dense regions are excellent reconstructed, however, under-dense regions (void 
statistics) are quantitatively poorly recovered. Contrary to the maximum a posteriori 
(MAP) solution which was shown to over-estimate the density in the under-dense 
regions we obtain lower densities than in N-body simulations. This is due to the 
fact that the MAP solution is conservative whereas the full posterior yields samples 
which are consistent with the prior statistics. The lognormal prior is not able to cap- 
ture the full non-linear regime at scales below ~ 10 h^^ Mpc for which higher order 
correlations would be required to describe the matter statistics. However, we confirm 
as it was recently shown in the context of Lya forest tomography that the Poisson- 
lognormal model provides the correct two-point statistics (or power-spectrum). 



1 Introduction 

The cosmological large-scale structure encodes a wealth of information about the 
origin and evolution of the Universe. A careful study of the cosmic structure can 
thus lead to a deeper understanding on structure formation and unveil the cosmo- 
logical parameters to unprecedented accuracy. However, the data are plagued by 
many observational effects like the mask and selection function of the particular 
surveys and the bias related to the matter tracer (e. g. galaxies). It is thus clear that a 
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Statistical treatment is necessary to compare observations with theory and perform 
a detailed study of structure formation. 

In this report we focus on the systematic effects which arise from observed data 
and present in detail the Bayesian approach (Section |2]i to describe the statistics of 
the data and the large-scale structure. 

Finally, we show some numerical experiments which unveil the state-of-the-art 
in the field and the problems which should be addressed in future work. 



2 Bayesian approach 

The evolution from a Gaussian homogeneous Universe to a complex non-linear and 
non-Gaussian cosmic web can be accurately modeled with A^-body simulations (see 
e. g. lHI). Hence we can test the different statistical models describing the nature of 
the matter distribution simplifying our model selection process. 

In this context a Bayesian approach is ideal as it clearly incorporates the assump- 
tions in form of conditional probability distribution functions (PDFs) making a 
distinction between the model for the observed/measured data d represented by the 
likelihood and the model for the seeked signal s represented by the prior (P(d|s,p) 
and f (s|p), respectively). Note that we have to condition all the PDFs to some set of 
parameters p which encode our prior knowledge. Bayes theorem yields the poste- 
rior: f(s|d,p) — ji^f(s|p)p(d'^p) ' short: Posterior^priorxhkehhood/evidence. The 
evidence which can be important for model comparison and selection can be simply 
considered as a normalization constant for our purposes. 



2.1 Bayesian inference steps 

From Bayes theorem we can already extract the necessary ingredients to perform a 
Bayesian analysis. First the prior and the Ukelihood have to be defined to find an 
expression for the posterior PDF. From the posterior one may obtain an estimate of 
the signal either computing the maximum or sampling the full posterior PDF. Here 
we enumerate the different steps: 

1 . Definition of the prior: knowledge of the underlying signal 

2. Definition of the likelihood: nature of the observed data 

3. Linking the prior to the likelihood: link between signal and data 

4. Bayes theorem: definition of the posterior 

5. Maximization of the posterior: MAP 

6. Sampling the posterior: MCMC 
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2.2 Definition of the prior: knowledge of the underlying signal 



The prior distribution function describes the statistical nature of the signal we want 
to recover from degraded measured data. In our case we want to obtain the three 
dimensional map of the large-scale structure represented by the matter over-density 
field 5m- For computational reasons we choose an equidistant grid with cells 
which permits us to use fast Fourier transforms. 



2.2.1 Gaussian prior: cosmic variance and cosmological parameters 

The most simple PDF to characterize a cosmic field with a given power-spectrum is 
represented by the Gaussian distribution f2l: 



'''^'■"' V(2»^e,(S.) ""' 
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(1) 



with p being the set of cosmological parameters which determine the auto-correlation 
matrix S5 = (5m^5m) or its Fourier transform, the power-spectrum ^^(k) (where 
k is the ^-vector in Fourier space). We know however that the matter statistics is 
skewed due to gravitation. We need non-Gaussian models to better characterize the 
matter field. 



2.2.2 Non-Gaussian priors 



The Gaussian distribution function can be expanded with the Edgeworth expansion 
O. However, this is only valid for moderate non-Gaussian fields. One can instead 
make a variable transformation of the Gaussian variable and apply the lognormal 
assumption |4|. Such a distribution function may also be expanded leading to very 
accurate fits in the univariate matter statistics comparing to A^-body simulations Q. 
Let us introduce the field ^ which has zero mean by definition for each cell 



<Pi = \npi - (Inp) = ln(l + 5m,) - Mi- 
Then the multivariate Edgeworth expansion is given by [6J : 



(2) 



=G(0) 



i'j'k' 
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with G{<P) being a Gaussian PDF with zero mean and variance S = {<P^(P) for the 
variable <P, {<Pji <P f <Pi^i) and {<l>ji<Pji<Pi^i<Pii)c the third and fourth order cumulants, 
and with hjjii and liiji^i being the third and fourth order Hermite polynomials. 



2.2.3 Lognormal model 

The multivariate lognormal model (f (5m|S) = G{<P)) is given by Q: 
Note that this PDF converges to the Gauss distribution when |5m| ^ 1- 



2.3 Definition of the likelihood: nature of the observable 

A galaxy sample represents a discrete biased sample of the underlying matter field. 
Its distribution can be sub- or super-Poisson depending on local and non-local prop- 
erties |8 9 10|. Based on a discrete version of the Press-Schechter formalism ifTTl 
found a Borel distribution. Another non-Poisson distribution was found in the con- 
text of a thermodynamical description of gravity lfT2l [131 . In llT4ll it is shown that 
both distribution functions can be identical under certain assumptions. Let us write 
the gravitothermal dynamics distribution function generalized to have a scale de- 
pendent parameter Q: 

pm,Q)-n ^^%f''^'' (5) 

^ (U^Ki - + £ei-,™A^m) exp [- YJk^I^ - Qk,n)K - E Qk-oN^l 

\ I m j \ n o / 

Note that this PDF simplifies to the Poisson distribution when Q is zero. 



2.3.1 Poisson limit 

For a sparse sample we can assume a Poisson distribution ifTSl : 

^ A/^*exp(— Ai.) 

P(N|A) = n ' l\ ■ (6) 
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Note that usually only the Poisson variance has been considered in the context of 
large-scale structure reconstructions lfT6l [TTl [TSl . The full treatment was introduced 
by imill and applied to the Sloan Digital Sky Survey ll20l . 



2.4 Link between the prior and the likelihood 

The link between the observed/measured data and the signal is usually not trivial 
and needs to be modeled to find the posterior distribution function. In particular 
we seek a relation between the expected number counts X and the signal we want 
to recover 5m In our case we have three main complications: the galaxy bias, the 
completeness of the survey and the uncertainties in the redshift positions. 

2.4.1 Galaxy bias 

The relation between the galaxy 5g and matter 5m density fields is non-local and 
non-linear [21] . Let us write such a general relation as: 

5g,=B(5M),-, (7) 
One may parametrize this relation expanding the density field as in ||22|: 

j 

here generalized to be non-local with the scale dependent bias parameters B]j, Z??-, 

Non-local transformations of the density field should be further investigated. 

Here one may incorporate the halo model into the Bayesian framework (see the 
recent works on halo model based reconstructions by Il23ll24ll '). 

2.4.2 Response operator 

The response operator R should encode the sky mask, radial selection function and 
may even encode the uncertainty in the redshift position of galaxies. In general such 
a relation is not trivial: 

A, = A,(5M)=-R(5g(5M))i. (9) 
If we focus our attention to the completeness w then we can write: 

h = WiN{l+B{5u)i). (10) 

with .^V being the mean number of galaxies in the observed volume. Assuming a 
linear bias relation b this expression is reduced to: 



6 



Francisco-Shu Kitaura 
(11) 



2.5 Bayes theorem: the posterior 

Armed with the prior, the likelihood and the link between both we can apply Bayes 
theorem to obtain the posterior PDF. A general expression for such a posterior PDF 
can be obtained by plugging in what we have discussed in previous sections (an 
expanded lognormal prior and a non-Poissonian likelihood); 

^(5m|N,S) 

°^ |n TT^^'^P (-^ E (1" (1 + 5m,) - i"/) s;^' (In (1 + 5m,,) - M./)^ I 

X + ^ E (0,0,0,,)cE5^/%!/\;%.(s-'/^^) 

I i'j'k' ijk 

^- i'j'k'l' ijkl ) 



X exp (-E(5l„ - Qk,n)yvnN(l +B(5m)«) -Y^Qk-oNr^j | , (12) 
If we assume a lognormal prior, a Poisson likelihood and a linear bias relation we 

getcma : 

f(0|N,S)ocG(0) (13) 
(w,^ (1 + (exp (0, + - l))f * exp (-w,^ (1 + /7 (exp (0, + - 1))) 



n 



Nkl 



where we have used the lognormal transformation relating the nonlinear density 
field 5m to its Gaussian component through 5Mi — exp (0, -f ^) — 1. 
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Once we have an analytical expression for the posterior distribution function we can 
compute the maximum of that distribution (MAP). The MAP solution for the signal 
s is obtained by searching for the extrema of the energy E{s) = — In (f (s|d,p)): 

dE(s) 

^^=0, (14) 

Here efficient schemes are crucial to deal with large cell numbers on which the 
density has to be computed. Iterative schemes have been shown to cope with this 
problem |T9ll7ll. 



2.7 Sampling the posterior 

Alternatively one may want to sample the full posterior distribution. Until now 
we have assumed that the power-spectrum is known and that the data have been 
previously converted to real-space correcting for redshift distortions. However, it is 
desirable to consistently estimate the peculiar velocity field v and relax the depen- 
dence on the cosmological model by jointly sampling the power-spectrum. This can 
be done splitting the full problem into simpler ones with conditional probability dis- 
tribution functions. In particular with the Gibbs-sampling scheme one can sample 
from the joint PDF P(5m, v, S |d^) of the matter density field 5m, the peculiar veloc- 
ity field V and the covariance (or power-spectrum) S given some nonlinear data in 
redshift space d'" as follows 

^a+i) ^P(0|v(;)^S,d"), (15) 
S^+i) ^P(S|0(^'+i)), (16) 
v(j+i) ^p(v|0(.'+i)), (17) 

with the arrows standing for the corresponding sampling process 

[19,1L25J. 



First, the matter density field (Eq. 15 1 can be sampled with the Hamiltonian sam- 
pling scheme 1321 [33l [34l l25l under the Gaussian prior assumption for the variable 
and encoding the lognormal transformation between the linear and the nonlinear 
density fields (5m, = exp - 1) in the likelihood ||35] |36l |32l . Second, the 



power-spectrum corresponding to <P (Eq. 16 1 can be consistently sampled with the 



inverse Gamma distribution function |19, 31 1. Finally, the peculiar velocity sam- 
pUng (Eq. [T7| which permits us to do the mapping between real- and redshift-space 
can be done with Lagrangian perturbation theory from the Gaussian component of 
the density field L38.,.39..19...25.|. 
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Here we demonstrate the numerical computation of the multivariate non-Gaussian 
matter field statistics and its power-spectrum. We will restrict ourselves in this work 
to the lognormal prior and the Poisson likelihood. It was shown in ||25l how to 
sample the power-spectrum consistently with this model. However, we will show 
here that even with a fix prior for the power-spectrum one can extract the underlying 
features and the correct shape of the power-spectrum since the dependence on the 
prior becomes sub-dominant in the presence of good enough data as it is provided 
with present galaxy redshift surveys. 



3.1 Setup 

We construct the mock observed data taking a random sub-sample of the particles 
in the Millennium run at redshift zero H] which was gridded on a 128^ mesh. Our 
setup permits us to avoid the biasing problem in our tests. Note that we also avoid 
the redshift distortions by considering the dark matter particles in real-space. The 
mocks were generated with a radial selection function using an exponential decay- 
ing model of completeness w ||40| . The final mock galaxy samples have 350961 
particles. The observer was set at the center of the box, i.e. at coordinates: X=250 
h^^ Mpc, Y=250 /j^' Mpc, and Z=250 Z;^' Mpc. We calculate the power-spectrum 
Pg (k) which determines the covariance matrix S with a nonlinear fit which also de- 
scribes the effects of virialized structures including a halo term as given by |,4JJ at 
redshift z = Q. We apply the Hamiltonian scheme with the ARGO-code lfT9l |7] [25] 
to sample the full posterior distribution function. 



3.2 Results 

Our results show the evolution of the density samples as the number of iterations 
increases together with its corresponding matter statistics and power-spectra (see 
Fig. [If. 

We find that on scales of about 4 h ' Mpc the over-dense regions are excellent re- 
constructed, however, under-dense regions (void statistics) are quantitatively poorly 
recovered (compare dark blue and green curves in panel g). Contrary to the maxi- 
mum a posteriori (MAP) solution which was shown to over-estimate the density in 
the under-dense regions ||2J we obtain lower densities than in A^-body simulations. 
This is due to the fact that the MAP solution is conservative whereas the full pos- 
terior yields samples which are consistent with the prior statistics. The lognormal 
prior is not able to capture the full non-linear regime at scales below ^ 10 Mpc 
for which higher order correlations would be required to describe the matter statis- 
tics. However, we confirm as it was recently shown in the context of Lya forest 
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tomography that the Poisson-lognormal model provides the correct two-point statis- 
tics or power-spectrum. Please note, how the power-spectrum (green curve in panel 
f) of the converged samples are similar to the underlying power-spectrum (dark blue 
curve in panel f) and differ from the prior power- spectrum (red curve). 



4 Discussion 

We have presented the Bayesian approach to infer density fields and power-spectra 
in the context of large-scale structure analysis from non-Gaussian distributed data. 
Although the results are very encouraging especially for matter field estimations 
in high-density regions and power- spectrum estimation some of the models need to 
be revised to get a more detailed characterization of the large-scale structure on 
small scales. The lognormal assumption leads to quantitatively wrong estimates 
in under-dense regions at scales below 10 Mpc. At those scales higher order 
correlation functions start to become relevant. We have shown how this could be 
modeled with a multivariate Edgeworth expansion. However, the problem of such 
an approach is that one would need additional models for the higher order correla- 
tion functions introducing hereby more parameters. A different ansatz based on a 
physical approach would be required to solve this problem. Focusing on the Gaus- 
sian component of the density field and encoding the nonUnear transformation in the 
likelihood is a very promising approach as it radically simplifies the problem. We 
have addressed other issues like the non-Poisson character of the galaxy distribution 
and how this could be implemented in a Bayesian context. Although the Bayesian 
techniques available are powerful enough to deal with complex problems we think 
that much more effort has to be done in this direction by studying the large-scale 
structure from simulations and extracting precise statistical models. 
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